library("IRdisplay")
display_png(file="imdb.png", width = 1000)
Warning message: "package 'IRdisplay' was built under R version 4.2.3"
1. Introduction¶
- Sentiment analysis is a tool for classifying people's impression of a product or topic.
- Using sentiment analysis, an algorithm can read a text written in natural language and map it to a scale ranging from positive to negative feelings.
- Because it is able to interpret human language, sentiment analysis is widely used in many online platforms.
- For instance, companies use sentiment analysis to gain direct feedback of customers of a product.
2. Overview of the project¶
- In this project, we create a sentiment analysis model to interpret movie reviews in the IMDB website.
- The data consists of 50000 reviews which have scores from 0 to 10.
- If the score is below or equal to 4, it is classified as negative. If the score is 7 or bigger, it is classified as positive (the reviews with scores of 5 and 6 are excluded from this analysis).
- The goal of the algorithm is to guess if the review is negative or positive directly from the text written by the reviewer.
3. Data processing¶
The data set, located in the file "alldata.tsv", is a table with 50000 rows and 4 columns. Each row represents a review, while the columns represent:
Id, the identification number of each review
Sentiment, 0 for negative and 1 for positive
Score, from 0 to 10 and excluding 5 and 6.
Review, the actual text written by the reviewer.
The data processing consists of 2 main steps: (1) removing irrelevant symbols and (2) filtering the text using a vocabulary.
4. Model¶
We use both R and Python to build our model. R is first used to identify the meaningful terms in the reviews. The vocabulary of extracted terms is then passed to Python to build the final model.
The detailed steps to build and test our classification model are:
Clean the data by removing punctuation marks and stop words.
From the cleaned reviews, build a vocabulary using all appearing N-grams terms, with N varying between 1 to 4 (individual words up to 4 sequential words).
Vectorize the reviews using Count Vectorization of the N-grams.
Reduce the size of the vocabulary to less than 1000 terms by using Logistic Regression with Lasso regularization (main step done in R on this notebook).
Re-vectorize all reviews using the reduced vocabulary of N-grams found by Lasso.
Train a Neural Network by feeding the review vectors (main step done in Python on the next notebook).
Evaluate the AUC of the Neural Network model on a test set.
In the next few sessions, the walk through all the computations done in this Project.
Part I. Vocabulary Reduction using R¶
library("text2vec")
library("glmnet")
Loading required package: Matrix Loaded glmnet 4.1-4
Generating the vocabulary¶
First, we extract all data and take a look at it
train = read.table("alldata.tsv",
stringsAsFactors = FALSE,
header = TRUE)
head(train,n = 2L)
| id | sentiment | score | review | |
|---|---|---|---|---|
| <int> | <int> | <int> | <chr> | |
| 1 | 1 | 1 | 10 | Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty. |
| 2 | 2 | 0 | 2 | This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw away all sense of reality. Let's see, word to the wise, lava burns you; steam burns you. You can't stand next to lava. Diverting a minor lava flow is difficult, let alone a significant one. Scares me to think that some might actually believe what they saw in this movie.<br /><br />Even worse is the significant amount of talent that went into making this film. I mean the acting is actually very good. The effects are above average. Hard to believe somebody read the scripts for this and allowed all this talent to be wasted. I guess my suggestion would be that if this movie is about to start on TV ... look away! It is like a train wreck: it is so awful that once you know what is coming, you just have to watch. Look away and spend your time on more meaningful content. |
Remove stop-words from the list given below
stop_words = c("i", "me", "my", "myself",
"we", "our", "ours", "ourselves",
"you", "your", "yours",
"their", "they", "his", "her",
"she", "he", "a", "an", "and",
"is", "was", "are", "were",
"him", "himself", "has", "have",
"it", "its", "the", "us")
it_train = itoken(train$review,
preprocessor = tolower,
tokenizer = word_tokenizer)
tmp.vocab = create_vocabulary(it_train,
stopwords = stop_words,
ngram = c(1L,4L))
tmp.vocab = prune_vocabulary(tmp.vocab, term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
dtm_train = create_dtm(it_train, vocab_vectorizer(tmp.vocab))
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
Now, use Logistic Regression with Lasso Regularization to reduce the number of words in the vocabulary
set.seed(3213)
tmpfit = glmnet(x = dtm_train,
y = train$sentiment,
alpha = 1,
family='binomial')
tmpfit$df
- 0
- 1
- 2
- 3
- 4
- 4
- 6
- 7
- 11
- 15
- 18
- 22
- 25
- 39
- 48
- 57
- 67
- 83
- 97
- 114
- 131
- 153
- 174
- 206
- 238
- 270
- 303
- 338
- 389
- 437
- 489
- 561
- 643
- 740
- 859
- 982
- 1126
- 1281
- 1471
- 1711
- 1962
- 2262
- 2585
- 2934
- 3273
- 3667
- 4087
- 4497
- 4890
- 5321
- 5735
- 6158
- 6587
- 7018
- 7400
- 7747
- 8086
- 8447
- 8778
- 9054
- 9376
- 9646
- 9886
- 10117
- 10354
- 10574
- 10760
- 10983
- 11150
- 11286
- 11421
- 11529
- 11666
- 11776
- 11860
- 11965
- 12053
- 12142
- 12232
- 12352
- 12419
- 12495
- 12542
- 12594
- 12631
- 12663
- 12705
- 12738
- 12776
- 12914
- 12882
- 12994
- 13038
- 13076
- 13115
- 13141
- 13190
- 13212
- 13253
- 13270
Then, take the largest vocabulary size which is less than 2000 (position 41):
myvocab = colnames(dtm_train)[which(tmpfit$beta[, 41] != 0)]
Now, use this smaller vocabulary to run another Lasso step:
it_train = itoken(train$review,
preprocessor = tolower,
tokenizer = word_tokenizer)
vectorizer = vocab_vectorizer(create_vocabulary(myvocab,
ngram = c(1L, 2L)))
dtm_train = create_dtm(it_train, vectorizer)
set.seed(3213)
tmpfit = glmnet(x = dtm_train,
y = train$sentiment,
alpha = 1,
family='binomial')
tmpfit$df
- 0
- 1
- 2
- 3
- 4
- 4
- 6
- 7
- 11
- 14
- 17
- 21
- 24
- 39
- 47
- 55
- 66
- 83
- 97
- 111
- 127
- 149
- 169
- 197
- 225
- 261
- 287
- 324
- 367
- 412
- 454
- 527
- 589
- 675
- 774
- 875
- 1005
- 1131
- 1287
- 1452
- 1572
- 1630
- 1659
- 1672
- 1682
- 1689
- 1693
- 1698
- 1699
- 1701
- 1703
- 1706
- 1709
- 1711
- 1715
- 1720
- 1722
- 1728
- 1731
- 1733
- 1736
- 1740
- 1740
- 1743
- 1743
- 1745
- 1746
- 1750
- 1750
- 1751
- 1754
- 1757
- 1758
- 1760
- 1763
- 1765
- 1767
- 1768
- 1769
- 1770
- 1770
- 1773
- 1773
- 1773
- 1776
- 1776
- 1777
- 1778
- 1778
- 1780
- 1782
- 1783
- 1783
Finally, we take the df closest to 1000 (position 37, df = 1005):
myvocab2 = colnames(dtm_train)[which(tmpfit$beta[, 37] != 0)]
Then, export the term in this vocabulary to csv:
write.csv(myvocab2, "myvocab2.csv")